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Summary 

We argue that the time from the onset of infectiousness to infectious contact, which we call the contact 
interval, is a better basis for inference in epidemic data than the generation or serial interval. Since contact 
intervals can be right-censored, survival analysis is the natural approach to estimation. Estimates of the 
contact interval distribution can be used to estimate R in both mass-action and network-based models. 
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1. Introduction 

Infectious disease remains one of the greatest threats to human health and commerce, and the analysis of 
epidemic data is one of the most important applications of statistics in public health. Some of the most im- 
portant questions involve the basic reproductive number, Rq, the number of secondary infections caused 
by a typical infectious person in the early stages of an epidemic (Diekmann and Heesterbeek, 2000). 
Higher values of Rq indicate that an epidemic will be larger and harder to control. The effects of inter- 
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ventions and the depletion of the susceptible population can be captured with the effective reproductive 
number R(t), which is the number of secondary infections caused by a typical person infected at time t. 

The generation interval of an infectious disease is the time between the infection of a secondary 
case and the infection of his or her infector. The serial interval is the time between the symptom onset 
of a secondary case and the symptom onset of his or her infector. The generation and serial interval 
distributions are often considered characteristic features of an infectious disease (Fine, 2003). For a given 
Ro, a shorter mean serial or generation interval implies faster spread of the epidemic. 

Usually, generation intervals are times between unobserved events. Serial intervals, which are times 
between observed events, are often used instead. Recent analyses of several past, emerging, and potentially 
emerging infectious diseases have been based on serial interval distributions, including the 1918 influenza 
(Mills et al., 2004), Severe Acute Respiratory Syndrome (SARS) (Lipsitch et al., 2003; Wallinga and Te- 
unis, 2004), pandemic influenza A(H1N1) (Fraser et al., 2009; McBryde et al., 2009), and avian influenza 
(Ferguson et al., 2005, 2006). Three methods form the basis of these applications. With a measurement 
of the exponential growth rate at the beginning of an epidemic and a known serial interval distribution, 
Rq can be estimated via the Lotka-Euler equation (Diekmann and Heesterbeek, 2000; Svensson, 2007; 
Wallinga and Lipsitch, 2007; Roberts and Heesterbeek, 2007). Two other methods use the time series of 
symptom onset times, assuming that all infections are symptomatic and observed. Wallinga and Teunis 
(2004) estimate R(t) given a known serial interval distribution. Their approach has been adapted by other 
researchers (Cauchemez et al., 2006), often supplemented with serial-interval observations from contact- 
tracing data. White and Pagano (2008) jointly estimate Rq and the serial interval distribution using a 
branching-process approximation to the initial spread of infection, assuming the number of secondary 
cases generated by each infectious person has a Poisson distribution with mean Rq. 
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There are several problems with estimators based on generation or serial intervals in the context of an 
emerging infection. The Lotka-Euler and Wallinga-Teunis estimators rely on a previously known genera- 
tion serial interval distribution. The Wallinga-Teunis and White-Pagano estimators assume that all serial 
intervals are independent and identically distributed, which occurs only if the incubation and infectious 
periods are constant. All three of these estimators assume a stable serial interval distribution, which lim- 
its their use to the early spread of infection. When multiple infectious persons compete to infect a given 
susceptible, the infector is the one who first makes infectious contact. Thus, the mean generation and 
serial intervals contract as the prevalence of infection increases either locally (e.g., within households) or 
globally (Svensson, 2007; Kenah et al., 2008). 

In this paper, we outline an alternative analysis of epidemic data that applies methods from survival 
analysis to contact intervals. Informally, the contact interval from an infectious person i to a susceptible 
person j is the time between the onset of infectiousness in i and the first infectious contact from i to j, 
where we define infectious contact to be a contact sufficient to infect a susceptible individual. This interval 
will be right-censored if j is infected by someone else prior to infectious contact from i or if i recovers 
from infection before making infectious contact with j. The contact interval is similar to the generation 
interval, except that its definition is not limited to contacts that actually cause infection and it begins with 
the onset of infectiousness rather than infection. 

Here, we focus on the analysis of completely-observed "Susceptible-Exposed-Infectious-Recovered" 
(SEIR) epidemics. The SEIR framework applies to acute, immunizing diseases that spread from person to 
person, such as measles, influenza, smallpox, and polio. We also assume that the epidemic is completely 
observed, so all cases are detected and their times of infection, onset of infectiousness, and recovery are 
observed. Most epidemics are only partially observed, so we plan to explore the analysis of more realistic 
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data sets in future papers. However, it is best viewed as a missing data problem, which requires that the 
methods for complete data be established. 

In Section 2, we define a general stochastic SEIR epidemic model and show that survival likelihoods 
for a vector of contact interval distribution parameters have score processes that are zero-mean martin- 
gales at the true parameter 6$. In Section 3, we show how estimates of the contact interval distribution 
can be used to estimate Rq in mass-action and network-based models. In Section 4, we evaluate the per- 
formance of these methods in simulated epidemic data and show that assumptions about the underlying 
contact process play a crucial role in accurate statistical inference. In Section 5, we discuss the advantages 
and limitations of survival methods in epidemic data analysis. 

2. Methods 

In this section, we show that the score processes from survival likelihoods for epidemic data can be written 
as stochastic integrals with respect to zero-mean martingales. We develop our methods in three stages. 
First, we describe the underlying stochastic SEIR model and the observed data. Second, we imagine that 
we observe who-infected-whom and derive counting-process martingales for an ordered pair ij and for a 
fixed susceptible j. Finally, we consider the situation where we do not observe who-infected-whom and 
derive counting-process martingales for a fixed susceptible j and for the complete observed data. Our 
sources for the underlying theory are Kalbfleisch and Prentice (2002) and Serfling (1980). 



2. 1 Stochastic SEIR model and observed data 

Consider a stochastic "Susceptible-Exposed-Infectious-Removed" (SEIR) model in a closed population 
of n individuals assigned indices 1, . . . ,n. Each person i moves from S to E at his or her infection time ti, 
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with ti = oo if i is never infected. After infection, i begins a latent period of length e< during which he 
or she is infected but not infectious. At time ti + Ei, i moves from E to I, beginning an infectious period 
of length ti. At time ti + T"», i recovers from infection and moves from I to R, where the recovery period 
r. L = Ei + ^ is the total time between infection and removal. Once in R, % can no longer infect other 
persons or be infected. The latent period is a nonnegative random variable, the infectious and recovery 
periods are strictly positive random variables, and the recovery period is finite with probability one. 

After becoming infectious at time ti + Ei, person i makes infectious contact with person j ' ^ i at their 
infectious contact time tij = ti + Ei + t*-, where the infectious contact interval t*- is a strictly positive 
random variable with r,* = oo if infectious contact never occurs. Since infectious contact must occur 
while i is infectious or never, r*j G (0, Li] or T*j — oo. We define infectious contact to be sufficient to 
cause infection in a susceptible person, so tj ^ tij with equality if and only if j is susceptible at time tij. 

An epidemic begins with one or more persons infected from outside the population, which we call 
imported infections. For simplicity, we assume that epidemics begin with one or more imported infections 
at time t = and there are no other imported infections. 

Contact intervals For each ordered pair ij, let dj = 1 if infectious contact from i to j is possible and 
Cij = otherwise. We assume that the infectious contact interval r^* is generated in the following way: 
A contact interval is drawn from a distribution with hazard function Xij(r). If r,j $5 Li and Cij = 1, 
then T*j = Tij. Otherwise r*, = oo. In this paper, we assume all contact intervals have an absolutely 
continuous distribution and, for a fixed i or a fixed j, the contact intervals Tij, i ^ j, are independent. 

Susceptibility and infectiousness processes Let Si(t) = and ij(t) = lfefa+e^ti+ri] be the sus- 

ceptibility and infectiousness processes, respectively, for person i, where lx = 1 if X is true and zero 
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otherwise. As denned, both processes are left-continuous and infectious contact from i to j is possible at 
time t only if C i3 Ii(t)Sj(t) = 1. 

Complete observed data Our population has size n, and m represents the number of infections we ob- 
serve. Observation begins at time t = and ends at time t = T. During this period, we observe the times 
of all S — > E (infection), E —> I (onset of infectiousness), and I — > R (recovery) transitions that occur in 
the population. For all ordered pairs ij, we observe CV, and any covariates Xy needed to specify Ajj (r) 
up to an unknown parameter vector 9 with true value #o- 



2.2 Score processes when who-infects-whom is observed 

Choose an ordered pair ij and let Nij(t) — lf^ti+ei+ru count the number of infectious contacts from i 
to j on or before time t. We count only the first infectious contact because j is infected on or before that 
time. Consider the filtration 

We assume that iV.y(0) = and Ajj(r) is predictable with respect to Hf , so 

M ij (t) = N ij (t)- f Xijiu-ti-sJCijl^Sji^du (2.1) 
Jo 

is a zero-mean martingale with respect Wf . Now suppose Ay (r) is specified up to a parameter vector 
9 with true value 9q, so Ajj(r) = Ay (t; #o). If the pair ij is observed from time until time T, the 
corresponding log likelihood is 

&ij(8)= / \nX i j{u-t i -£ i ;8)dNij{u)- j Ay(u-tj- €^6)0^1^) Sj(u)du. 
Jo Jo 
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If In Xij (t; 9) is differentiable with respect to 9 and we can interchange the order of differentiation and 
integration, the score process for data in the time interval [0, t] is 

Ua(jS,t)=J ^ln\ ij (u-t i -£ i ;e)dM ij (6,u), (2.2) 

where 

Mij(6, u) = Nij(t) - / Xij(u - U - Ei\ 9)CijIi{u)Sj{u) du. 
Jo 

Therefore, Uij(0o,t) is a zero-mean martingale because it is the integral of a predictable process with 
respect to M l} (0 O , t). When C ?J = 0, we have My (9, t) = U t3 (9, t) = for all 9 and t. 

Now fix j and assume there exist covariates Xij such that Ay(r;0) = X(t;9, Xij) for all i ^ j. 
For each i ^ j, assume iVy(O) = and Ajj(r) is predictable with respect to Hf . Since the contact 
intervals Tij are independent for a fixed j and absolutely continuous, the Mij(6o, r) from equation (2.1) 
are orthogonal zero-mean martingales with respect to the filtration 

Ul = a(Nij(u), Ii(u),Sj(u) :0^u^t,i^j). 

The total score process for j is 

U.j{6,t)=Y J U i j{e,t), (2.3) 

and U.j (9o, t) is a zero-mean martingale with respect to "H/ because it is a sum of zero-mean martingales. 
The score process in equation (2.3) is that of a survival likelihood where the tij are failure times and 
CijIi(t)Sj(t) = 1 indicates risk of infectious contact in the ordered pair ij. At the earliest infectious 
contact, the contact intervals in all remaining pairs at risk are right-censored, which is a type II independent 
censoring mechanism (Kalbfleisch and Prentice, 2002). 
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2.3 Score processes when who-infects-whom is not observed 

In the previous section, U.j(0, t) is adapted only if we observe which of the Nijit) jumps first, which is 
equivalent to observing the infector of person j. Now suppose that we observe the infection time of j but 
not which person i was the infector. This is equivalent to observing N.j(t) = Yl,i=tj Jo ^ii u ) dNij(u), 
which counts the first infectious contact received by j. The corresponding filtration is 

U'i = a(N. j (u),I i (u),S j (u) :0^u^ t,i^j), 

and the corresponding zero-mean counting process martingale is M.j(8 , i), where 

M.j (6,t) = N.j (t) - j X.j (u; 0)Sj (u) du (2.4) 



o 



and \.j(t; 6) = ~ k — £ ^ @> Xij)djli(t). We can no longer calculate U.j(0, t) as defined in 

equation (2.3), but we can calculate its conditional expectation given H t 3 . For each ij, define the expected 
score process 

Uij(0,t) = J ^\n\(u-t i -e^e,X ij )E[dM i3 {9,u)\ki]. (2.5) 
Given that N.j jumps at time t, the probability that the jump occurred in Nj is 



Px(dN i j(t) = i\dN.j(t) = i,e,n t J ) 

\.j{t;V) 

Thus, 



E[dMij(6, u)\H t J ] = -i , V m ' dN. 3 {u) - \{u -U - e t ; 9, JT^JCy /<(«)«,-(«) du, 

A.j(u; V) 

and equation (2.5) can be rewritten 

u ij{v,t)= — r~? — a\ dN.j{u)~ —Xlu-ti-e^ejXijjCijlilujSjiujdu. 

Jo \.j{u;V) J OV 
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Therefore, the expected score process for person j is 



U. 3 (9, t) = J2 Uij (t) = J ^ In A.j («; 0) dM.j (0, «), (2.6) 
which is the score process of the of the log likelihood 

£.j(6)= [ \n\. j (u;6)dN. j (u) - [ \. J (u;8)S J (u) du. (2.7) 
Jo Jo 

U.j(6o, t) is a zero-mean martingale with respect to H' t J because it is the integral of a predictable process 

with respect to M.j(8o,t). For an imported infection j, U.j(6, t) = for all 6 and all t £ [0, T], 

Finally, consider the filtration 

H t = a(N.j(u),Ij(u),Sj(u) : < u < t,j = l,...,n) 

generated by the complete data described at the end of Section 2.1. Since we assume that the ry, j ^ i, 
are independent for a fixed i and absolutely continuous, the M.j(9 , t) from equation (2.4) are orthogonal 
zero-mean martingales with respect to T-Lf The total expected score process is 

n 

U(6,t) = J2u. j (0,t), (2.8) 
which is the score process for the log likelihood 

n 

£(6)=J2Z-j(0)- (2.9) 

3=1 

U (9o, t) is a zero-mean martingale with respect to "Ht because it is a sum of zero-mean martingales. The 
maximum likelihood estimate (MLE) for 9 is the solution to the equation U (9, T) = 0. 
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2.4 Asymptotic distribution of 8 

In this section, we show that the variance of U (9o, t) can be estimated using its predictable and optional 
variation processes, which are unbiased estimators of the Fisher information from the survival likelihood. 
We then use the Lindeberg-Feller Central Limit Theorem to give a heuristic justification for standard 
maximum likelihood estimation with epidemic data. Throughout this section, we assume that A(t; 9, X) 
has a bounded second derivative in 6 and that integration and differentiation can be interchanged. 
Taking the derivative of U.j (9, t) with respect to 6 in equation (2.6) leads to 

-^U.j(e,t) = J w ]n\. j (u;6)dM. j (e,u)-J Q [— In X.^u; 0)][— In A. 3 -(«; 6)] T X. j (u; 9)S,(u) du. 

Setting 9 = 9 n makes the first term the integral of a predictable process with respect to a zero-mean 
martingale. Therefore, 



E 



= E 


1 







[—\nX. j (u;e )}[-^lnX. j (u-,e )} T X. j (u-,e )S j (u)du 



(2.10) 



so the predictable variation process (U.j(9o))(t) is an unbiased estimator of "Var[U.j(9o, t)]. By equation 
(2.8) and orthogonality of the U.j (8q, t), the total predictable variation process (U (9 ))(t) — (U.j (0o))(t) 
is an unbiased estimator of Var[C/ (9q, t)]. 

To show that the same result holds for the optional variation process, rearrange equation (2.6) to get 

U.j(9,t)=J ^lnX. j (u;6)dN. j - J ^X. 3 (u;9)S J (u) du. 

Taking the derivative with respect to 9 yields 

-I*"* ^ = f [ l ^ A ^ [ l ^ ^ ^ 6)]T dK > {U) ~[%^ dM ^ U) - 
Setting 9 = 9q makes the second term the integral of a predictable process with respect to a zero-mean 
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martingale. Therefore, 
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E 



= E 


[L 







' In A.j («; e )} [|- In X. 3 («; O )] T dMj («) 



(2.11) 



so the optional variation process [U.j(6o)](t] is an unbiased estimator of War[U.j(0o,t)] and the total 
optional variation process [U(9o)](t) = J2jW-j(9o)]{t) is an unbiased estimator of Vax[U(6o, t)}. 

Imagine a series of epidemics in larger and larger populations, and assume that the final sizes of the 
epidemics become infinite as the population size n —> oo. For any fixed T, the number of infections will 
not become infinite as n — » oo, which makes it difficult to apply the Martingale Central Limit Theorem 
to U(0o,T). Instead, imagine that we observe m„ infections in a population of size n between time 
and time T n , with m n —> oo as n —> oo. Let U n (8,T n ) be the corresponding total expected score 
process, and let 8 n be the corresponding MLE. If the Lindeberg condition holds for the triangular array 
U. 1 (e ,T n ),...,U. n (p ,T n ),fhBn 

U 2 {0M ; -h. N(Q,l) 
Var[U n {6 ,T n )]2 

in distribution as n — > oo by the Lindeberg-Feller Central Limit Theorem (Serfling, 1980). Heuristically, 
this justifies the use of maximum likelihood methods such as Wald, score, and likelihood ratio tests. 



3. Estimation of R 

The contact interval distribution can be used to estimate i?o in both network-based and mass-action mod- 
els. For simplicity, we assume that the hazard of infectious contact does not depend on covariates. Thus, 
Xij (t; 6) = A(t; 9) for all ij and the results in this section apply to homogeneous populations. For mass- 
action models, we describe an asymptotic likelihood that is valid for the initial spread of disease. 
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3.1 Network-based models 



In a network-based model, transmission takes place across the edges of a contact network, so we have 
C\j = 1 if and only if there is an edge leading from i to j in the contact network. Here, we will assume 
that contact networks are undirected, so Cy = Cji for all i and j. In a network-based model, i?o depends 
on the structure of the contact network. The most tractable models are those on configuration-model 
networks, which are maximally random except for their degree distribution (Molloy and Reed, 1995, 1998; 
Newman et al., 2002). More formally, let D be a nonnegative discrete random variable with finite mean 
and variance. To construct a configuration-model network with n nodes, assign each node i = 1, . . . , n a 
degree d, randomly sampled from the distribution of D. Then connect the stubs at random, erasing one 
stub if necessary so the sum of the degrees is even. As n — > oo, the probability of multiple edges between 
two nodes or loop from a node to itself goes to zero. 

In these networks, there is a straighforward definition of Ro (Andersson, 1998; Newman, 2002; Kenah 
and Robins, 2007). In the early stages of transmission, an infected node of degree d has d — 1 edges across 
which infection can be transmitted. The probability of transmitting infection across each of these edges 
is cxp(— A(i; 6*o)), where l is the infectious period and A(i; 0) = J Q A(u; 9) du. Since the probability 
of reaching a node by following edges is proportional to the degree of the node, the mean number of 
secondary infections generated by a typical infected node in the early stages of an epidemic is 



where the first expectation is taken over the distribution of the infectious period l. 

Network-based likelihood In a network-based model, the likelihood £(6) in equation (2.9) depends only 
on data about individuals who are either infected before time T or connected to an infected person in the 




(3.1) 
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contact network. In principle, these people could be identified through surveillance and contact tracing. 
For all other individuals j, U.j(6, t) = for all t E [0, T] because Cijli(t) = for all i. Since is 
the expected degree of persons who are infected by transmission within the population, it can be estimated 
by calculating the mean degree of persons who are infected. 



3.2 Mass-action models 

In a mass-action model, individuals form no stable social bonds and interact like gas molecules. Thus, 
C\j = 1 for all ij but the hazard of infectious contact is inversely proportional to the population size. If 
A n (r; 8) is the hazard function for the contact interval distribution in a population of size n, 

n — 1 

for a baseline hazard function Ao(t; 8) with corresponding cumulative hazard function A (t; 8). As be- 
fore, these functions are specified up to an unknown parameter vector 8 with true value 8$. 

The baseline hazard and cumulative hazard functions of a mass-action model have useful interpreta- 
tions in terms of Rq and the time course of infectiousness in the limit as n —> oo. Given an infectious 
period i, the expected number of infectious contacts made is 

Ro = (n - 1) (l - e -^r A °(''^)) — > Ao(t; Q ). (3.2) 

Given that i makes infectious contact with j and has infectious period t, the probability density function 
of the infectious contact interval from i to j is 

T ±- 1 X (r;8)e-^ A ^ X (r;8 a ) 

1 _ e -drrAo( t ;0o) Rq ' 1 ' > 
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Mass-action likelihood Let m be the total number of infections observed before time T. If m <C n, an 
approximate likelihood that depends only on information about infected presons can be written in terms 
of Ao(t; 9). Expanding equation (2.9) in terms of Ao(r; 0), we get 

n r T n r T 



f / \ f 

E / ln ( E v« - ** - e ^ dN 'j( u ) - E / ln ( n - : ) dN -^ u ) 

-4tE / (5^Ao(«-*i-ei;fl)-ri(«))5j(«)d«. (3.4) 



All summands in the first term are zero except for those j with tj ^ T. The second term is not a function 
of 9 and can be ignored. The third term can be split into terms from j who get infected on or before time 
T and from those who remain uninfected at time T: 

(E Aofttj -*,-£,) A y E AoKT-ti-eOAt,;*), 

where x Ay = min(a;, y). Since the first term is less than or equal to 

- ^((T-ti-eOAti^), 



r 



we have 

— »■ E ( ln (E A ofe - *< - - Ao((T - *j - £j) A tj -;0)) (3.5) 

for a fixed m as n — > oo. This asymptotic likelihood depends only on information about infected people. 
In principle, these people could be identified through surveillance. 



4. Simulations and illustration 

In this section, we first look at the performance of the methods from Sections 2 and 3 in simulated epi- 
demic data sets from mass-action and network-based models. We then illustrate the use of our methods 
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with an analysis of two epidemic curves from the early spread of influenza A(H1N1) in Mexico. 
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4. 1 Simulations 

In this section, we look at the performance of the methods from Sections 2 and 3 in simulated epidemic 
data sets. In all models, we used data from the first m = 1, 000 infections in a population of size n = 
100, 000. For each infected person i, we recorded the infection time ij, the onset of infectiousness ti + e,, 
and the recovery time ti + r». In network-based models, the degree di and the indices of all neighbors 
of i were also recorded. All outbreaks started with a single imported infection at time 0. Since we are 
interested primarily in the analysis of emerging epidemics, outbreaks that terminated with a final size less 
than 1,000 were discarded. If an epidemic model was run 100 times without producing an epidemic final 
size of at least 1,000, it was discarded and another model was generated. For all simulations, R was 
constrained to be between 1.01 and 16, a range that covers almost all known epidemic diseases. 

In network-based models, the contact networks were undirected Erdos-Renyi random graphs (New- 
man et al., 2006) with an expected degree chosen from the discrete uniform distribution on {2, . . . , 16}. 
A new contact network was constructed for each simulation. 

Four scenarios were considered within each class of model: exponential or Weibull (baseline) contact 
interval distributions with constant or exponentially-distributed infectious periods. All infectious period 
distributions had mean one. The exponential distribution has the hazard function A(r; /3) = (3 for all 
t > 0, where (3 > is the rate parameter. The Weibull distribution has the hazard function A(r; a, j3) = 
a/3(/3T) a_1 for all r > 0, where a > is the shape parameter and j3 > is the rate parameter. Note that 
the exponential distribution is a Weibull distribution with a = 1. 
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Parameter estimates For network-based models, we used the likelihood in equation (2.9) to estimate the 
parameters of the contact interval distribution. For mass-action models, we used the asymptotic likelihood 
in equation (3.5) to estimate the parameters of the baseline contact interval distribution. Maximum likeli- 
hood estimates were obtained using the mle function in the R library stats4. Confidence intervals for 
each parameter were calculated using the c o n f i nt function, which inverts the one-parameter likelihood 
ratio chi-squared test using a profile likelihood. 

Rq estimates For network-based models, Rq was estimated using equation (3.1). For mass-action mod- 
els, i?o was estimated using equation (3.2). We calculated bootstrap percentile confidence intervals by 
sampling contact interval distribution parameters from their approximate joint normal distribution and 
combining each sample with a bootstrap sample of the observed infectious periods (and, for network- 
based models, observed degrees in the contact network). The 95% confidence interval was defined by the 
2.5% and 97.5% quantiles of the point estimates from 10,000 samples. 

Implementation Simulations were implemented in Python 2.6 (www.python.org) using the SciPy 0.7 
package (Jones et al., 2009). Analyses were performed in R 2.10 (R Development Core Team, 2009) via 
the RPy 2.0 package (Moreira and Warnes, 2009). Contact networks were generated using the NetworkX 
0.99 package (Hagberg et al., 2008). Sampling from multivariate normal distributions was done using 
the Cholesky distribution of the covariance matrix (Rizzo, 2008). The simulation code is included as 
supplementary material (http://www.biostatistics.oxfordjournals.org). 

4.1.7 Mass-action models 

For mass-action models with exponential contact intervals, Rq = f3 for both fixed and exponentially- 
distributed infectious periods. Let f3 denote the MLE of the rate parameter /?, and let Lk denote the infec- 
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tious period of the k th infection observed. Our point estimate of Rq is 



£o = -Y>;. (4.1) 

fe=l 

A bootstrap sample of R is 

^S=-y]/3*4: (4-2) 

where /3* is a parametric bootstrap sample from the approximate normal distribution of f3 and i\ , . . . , 
is a bootstrap sample from the observed i\ , . . . , t m . 

For mass-action models with Weibull contact intervals i?o = fi a for a fixed infectious period and 
Ro = f3 a T(a + 1) for exponentially-distributed infectious periods. In both cases, 

rn 

#o = -V(/^r, (4-3) 
m z — ' 

fc=i 

where a is the shape parameter MLE and $ is the rate parameter MLE. A bootstrap sample of R is 

^ = -E^*^r*' (4 - 4) 

where (a*, /?*) is a sample from the approximate joint normal distribution of (a, j3). 



Results Table 1 shows the coverage probabilities achieved in 1,000 simulations and exact binomial 95% 
confidence intervals for the true coverage probabilities in each of the four types of mass-action model. 
Figure 1 shows a scatterplot of Rq versus Rq for models with exponential contact interval and infectious 
period distributions. Figure 2 shows a scatterplot of estimated versus true ln(i?o) for models with Weibull 
contact interval distributions and exponential infectious period distributions. For these models, estimates 
of Rq are right-skewed because of exponent a in equation (4.3); this is reduced by taking logarithms. 
Similar results were obtained in models with a fixed infectious period. 
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4.1.2 Network-based models 

Let ik and dk denote the infectious period and degree, respectively, of the fc th infection observed. In a 
contact network with n nodes, let D be the mean degree and let 

n 
i=l 

For network-based models with exponential contact intervals, Rq — (1— exp(— /3))-D for a fixed infectious 
period and Rq = for exponentially-distributed infectious periods. In both cases, 

m 

A) = -y;(i-e-^)(4-i). 

m L — ' 

fc=i 

A bootstrap sample of Rq is 

J2S = -E(l-e-^)(dJ-l), 
fc=i 

where /3* is a sample from the approximate normal distribution of (3 and (t* , ),■•■, (t^, , ) is a boot- 
strap sample from (<,]_, di), . . . , (t m , d m ). 

For network-based models with Weibull contact intervals, Rq = (1 — exp(— /3 a ))D for a fixed infec- 
tious period and 

/>oo 

JZo = 1- / e-^"- x dx. 
Jo 

for exponentially-distributed infectious periods. In both cases, 

i ?o = -E( 1 -^ ( ^ ) ")K- 1 )- 

TO * — ' 

fc=l 

A bootstrap sample of i?o is 

^ = -E( 1 - e_(rtD<, *)(^- 1 )' 

TO * — ' 

fc=l 

where (a* , f3*) is a sample from the approximate joint normal distribution of (a, (3) and (t*, d\), . . . , (i^, d£J 
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is a bootstrap sample from (ti, di), . . . , (t m ,d m ). 

Results Table 2 shows the coverage probabilities achieved in 1,000 simulations and exact binomial 95% 
confidence intervals for the true coverage probability in each of the four types of network-based model. 
Figure 3 shows a scatterplot of the estimated versus true i?o for models with exponential contact interval 
and infectious period distributions. Figure 4 shows a scatterplot of the estimated versus true Rq f° r models 
with Weibull contact interval distributions and exponential infectious period distributions. Similar results 
were obtained in models with a fixed infectious period. 

Mass-action estimates To look at the effect of assumptions about the contact process on statistical infer- 
ence during an epidemic, we applied the mass-action likelihoods to data generated by the network-based 
models, ignoring all information about the contact network. Table 2 shows the coverage probabilities 
achieved in 1,000 simulations and exact binomial 95% confidence intervals for the true coverage proba- 
bilities for mass-action estimates applied to network-based models. The '+' signs in Figures 3 and 4 show 
the mass action estimates of Rq versus the true Rq in network-based models with exponential infectious 
periods. Many of these points fall above the top edge of each graph. Similar results were obtained in 
models with a fixed infectious period. 



4.2 Illustration: Influenza A(H1N1 ) in Mexico, 2009 

To show the practicability of methods based on contact intervals as well as the importance of data that is 
uncollected or unreported in emerging epidemics, we attempted to estimate Rq based on two epidemic 
curves published at the beginning of the influenza A(H1N1) pandemic in Mexico. The first epidemic curve 
contains suspected cases in the village of Vera Cruz between March 9 and March 20 (Fraser et al., 2009). 
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The second epidemic curve contains lab-confirmed cases in Mexico City between April 13 and April 24 
(Ministry of Health, 2009). In both analyses, we assumed a latent period (between infection and the onset 
of infectiousness) of one day and an incubation period (between infection and onset of symptoms) of two 
days. With no data on links between cases or the duration of illness in each case, we assumed mass-action 
with a constant infectious period. Confidence intervals are generated as in the simulations. 

Assuming an exponential contact interval distribution, we get Rq — 1.95 (1.63, 2.33) for Vera Cruz 
and Rq = 2.31 (2.15, 2.48) for Mexico City. These are high but consistent with some early estimates 
(Fraser et al., 2009; Yang et al., 2009). Assuming a Weibull contact interval distribution, we get R = 
3.08 (2.55,3.65) for Vera Cruz and R Q = 4.37 (4.06,4.70) for Mexico City; in both cases, the null 
hypothesis of an exponential contact interval distribution is strongly rejected (likelihood ratio p-value 
< .001). The estimates are also sensitive to the assumed infectious period. Assuming a five-day infectious 
period and a Weibull contact interval distribution, we get R = 3.53 (2.79,4.30) for Vera Cruz and 
Rq = 7.14 (6.63, 7.66) for Mexico City. Subsequent experience shows that these Rq estimates are far 
too high. This bias is consistent with the results obtained above when applying mass-action estimates 
to simulated data generated by network-based models. Since a most influenza transmission takes place 
in households, workplaces, and schools (Yang et al., 2009) the true underlying transmission model is 
probably closer to a network-based model than a mass-action model. Data on the duration of illness 
and, more importantly, on the social links between cases would allow better point and interval estimates 
of Rq. The estimates could also be improved with incomplete-data methods that took into account the 
discreteness of the data and allowed variability in latent, incubation, and infectious periods. 
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5. Discussion 

The results of the simulations confirm that standard maximum-likelihood methods can be applied suc- 
cessfully to survival likelihoods written in terms of the contact interval distribution. In the mass-action 
models, performance deteriorated noticeably in moving from exponential to Weibull contact interval dis- 
tributions, possibly because U(6q,T) was closer to a normal distribution in the simpler models. No such 
deterioration was noticeable in the network-based models, possibly due to the addition of contact-tracing 
information. Our methods were deliberately simple: all point estimates were plug-in estimators and all 
confidence intervals were based on normal approximations for the joint distributions of the MLEs. More 
sophisticated methods, such as Bayesian methods, might produce point estimates and confidence inter- 
vals whose performance is even better. The methods here would adapt quite well to a Bayesian analysis, 
and we believe that a Bayesian framework is the most natural setting for the development of methods to 
analyze partially-observed epidemics. 

Methods based on contact intervals can incorporate a much greater variety of transmission models 
than methods based on generation or serial intervals, which usually assume mass-action. The simulation 
results presented above show that this flexibility is essential for accurate statistical inference during an 
epidemic. The mass action estimates failed spectacularly when applied to data generated by network- 
based models. The point estimates were severely biased upward, and all 95% confidence intervals had 
coverage probabilities below 85%, with most below 25%. 

The methods and simulation results in this paper have important implications for data collection during 
an emerging epidemic. First, they require information on the onset and duration of infectiousness. For 
an acute infectious disease, the onset and duration of illness may provide a useful proxy, especially if 
there is some knowledge of the incubation period and the pattern of pathogen shedding. Second, they 
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show the potential value of data about close contacts of cases, whether or not they are infected. Methods 
based on generation and serial intervals do not require such data, but this apparent advantage comes at 
a tremendous cost in terms of the flexibility and validity of the subsequent analysis. They are essentially 
missing-data methods with no complete-data counterparts, and they almost certainly understate the true 
data requirements for accurate estimation of i?o . 

Limitations The SEIR framework limits our methods to acute, immunizing diseases that spread person- 
to-person. It does not apply to many diseases of public health importance, such as tuberculosis, meningo- 
coccal or pneumococcal diseases, foodborne or waterborne diseases, or HIV/AIDS. Most (though not 
all) emerging infections fit into the SEIR framework, and almost all methods currently used to analyze 
data from emerging epidemics make this assumption. We also assumed that all times of infection, on- 
set of infectiousness, and recovery are observed. This is clearly unsatisfactory, but the development of 
incomplete-data methods must be based on complete-data methods. In Section 2, we assumed that the 
contact interval is independent of the infectious period a, of i. This simplified the likelihoods, but it is 
probably unrealistic. This problem could be addressed by including n as a covariate in Xij or by using 
multivariate survival methods. In Section 3, we assumed that the population is homogeneous. This simpli- 
fied the estimation of i?o, but it is also unrealistic. In a heterogeneous population, estimates of Rq would 
have to include the distribution of relevant covariates in the population. 

Despite these limitations, methods based on contact intervals and survival analysis have the potential 
to become important tools in infectious disease epidemiology. The purpose of this paper was to introduce 
survival analysis based on contact intervals as a useful complete-data method, and we have done so in the 
simplest setting possible. These methods can be seen as descendants of methods based on generation and 
serial intervals, but they are more flexible and more explicit about assumptions and data requirements. 
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Table 1. Coverage probabilities for mass-action models. 



Infectious period 
distribution 


Parameter 


Coverage probability 


Exact binomial 95% CI 


Constant 


P 
Ro 


Exponential contact interval 

.952 (.937, 
.950 (.935, 


.964) 
.962) 


Exponential 


P 
Ro 


.951 
.963 


(.936, 
(.949, 


.964) 
.974) 


Constant 


a 
ft 
Rq 


Weibull contact interval 

.936 (.919, 
.907 (.887, 
.879 (.857, 


.950) 
.924) 
.899) 


Exponential 


a 
P 
Ro 


.927 
.912 
.902 


(.909, 
(.893, 
(.882, 


.942) 
.929) 
.920) 



Table 2. Coverage probabilities for network-based models. 



Infectious period 
distribution 


Parameter 


Network-based estimates 
Coverage Exact binomial 
probability 95% CI 


Mass-action estimates 
Coverage Exact binomial 
probability 95% CI 








Exponential 


contact 


interval 






Constant 


P 


.942 


(.926, 


.956) 


.004 


(.001, 


.010) 




Ro 


.948 


(.932, 


.961) 


.210 


(.185, 


.237) 


Exponential 


ft 


.943 


(.927, 


.957) 


.035 


(.024, 


.048) 




Ro 


.962 


(.948, 


.973) 


.000 


(.000, 


.004) 








Weibull contact interval 








n 


.936 


(.919, 


.950) 


.798 


(.772, 


.822) 


Constant 


ft 


.946 


(.930, 


.959) 


.025 


(.016, 


.037) 




Ro 


.945 


(.929, 


.958) 


.509 


(.477, 


.540) 




a 


.946 


(.930, 


.959) 


.834 


(.809, 


.857) 


Exponential 


ft 


.950 


(.935, 


.963) 


.031 


(.021, 


.044) 




Ro 


.941 


(.925, 


.955) 


.392 


(.362, 


.423) 
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Fig. 1. Scatterplot of estimated versus true Ro for mass-action models with exponential contact interval and infectious 
period distributions, showing excellent agreement. Similar results were obtained in models with a fixed infectious 
period (not shown). 
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Estimated versus true ln(R ) 
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Fig. 2. Scatterplot of estimated versus true ln(i?o) for mass-action models with Weibull contact interval distributions 
and exponential infectious period distributions. Estimates are nearly unbiased at low Ro, but biased upward at high 
Ro. Similar results were obtained in models with a fixed infectious period (not shown). The smoothed mean was pro- 
duced with the R command lowess. One simulation that produced Ro — 3730.8 (1089.9, 12,245.8) was excluded 
from the graph; it had a true Ro = 10.5. 
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Fig. 3. Scatterplot of estimated versus true _Ro for network-based models with exponential contact interval and infec- 
tious period distributions, showing excellent agreement. The mass-action estimates are severely biased upward; most 
are out of range of the plot. Similar results were obtained in models with a fixed infectious period (not shown). 
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Fig. 4. Scatterplot of estimated versus true Ro for network-based models with Weibull contact interval distributions 
and exponential infectious period distributions, showing excellent agreement. The mass-action estimates are severely 
biased upward. Similar results were obtained in models with a fixed infectious period (not shown). 



